Search CORE

4 research outputs found

A summary of the 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition

Author: Bennett Erin
Borschinger Benjamin
Chiu Justin
Church Kenneth
Clark Pascal
Dunbar Ewan
Dupoux Emmanuel
Feldman Naomi
Fourtassi Abdallah
Goldwater Sharon
Harwath David
Hermansky Hynek
Jansen Aren
Johnson Mark
Khudanpur Sanjeev
Lee Chia-ying
Levin Keith
McGraw Ian
Metze Florian
Norouzian Atta
Peddinti Vijay
Richardson Rachel
Rose Richard
Schatz Thomas
Seltzer Mike
Thomas Samuel
Varadarajan Balakrishnan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding zero resource (unsupervised) speech technologies and related models of early language acquisition. Centered around the tasks of phonetic and lexical discovery, we consider unified evaluation metrics, present two new approaches for improving speaker independence in the absence of supervision, and evaluate the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations. Finally, we present two strategies for integrating zero resource techniques into supervised settings, demonstrating the potential of unsupervised methods to improve mainstream technologies.5 page(s

Edinburgh Research Explorer

Macquarie University ResearchOnline

Techniques for two-stage open vocabulary spoken term detection and verification

Author: Norouzian Atta
Publication venue: McGill University
Publication date
Field of study

Spoken term detection (STD) is one of many applications that require a capability for searchand retrieval of spoken content from large media repositories. In a typical STD scenario, auser enters a query term consisting of a word or phrase and, in response, the search enginereturns a list of detected occurrences of the query term in the repository. The state-ofthe-art STD systems use an automatic speech recognition (ASR) system for generatinga tokenized representation of the speech and perform search on this representation tofind hypothesized occurrences of the query terms. Varying acoustic conditions, speakerpopulations, and speaking styles, along with specialized task domains, all contribute togenerally poor speech recognition performance in many STD scenarios. Furthermore, thesize of media repositories can be extremely large, in some cases on the order of thousandsof hours of audio material. These would reduce the search accuracy and speed respectivelyin ASR-based STD systems. The objective of this thesis is to address these issues.The work presented in this thesis constitutes four major contributions. The first is thedevelopment of a fast and accurate ASR-based STD approach for large audio repositories.This approach is based on ecient indexing of ASR outputs and a two-stage phonemebasedsearch procedure which facilitates detecting occurrences of all query terms, whetherthey belong to the ASR vocabulary or not. The second contribution is the developmentof a graph-based approach for verifying the occurrence of query terms in the set of candidatespeech intervals derived from an STD system. In this approach, the confidence scoresassociated with the hypothesized query term occurrences, generated by the original STDsystem, are adjusted based on the acoustic similarity of the corresponding acoustic intervalsto each other and to other intervals in the repository. The third contribution of this thesisis the use of a feature representation and modeling formalism, distinct from those used inconventional ASR systems, for generating alternative confidence scores for a given set ofhypothesized query term occurrences. It is shown that the resulting confidence scores arecomplementary to the confidence scores estimated in conventional ASR-based STD systems.The fourth contribution is the development of two manifold-based semi-supervisedapproaches for verifying hypothesized occurrences of query terms. It is demonstrated thatdeploying unlabeled data in addition to labeled data in training term-dependent modelsunder the proposed semi-supervised framework improves the verification accuracy. Moreover,in extremely low-resource scenarios, reasonably good STD performance is achieved by only exploiting the similarity of the hypothesized query term occurrences using a semisupervisedapproach based on graph spectral clustering.La d`etection de terme parl´es (DTP) est une des nombreuses applications qui a la capacit´ede rechercher et de retrouver un contenu parl´e dans les grands r´epertoire multim´edia. Dansun sc´enario DTP typique, un utilisateur entre un terme de la requˆete constitu´e un mot ouune phrase et, en r´eponse, le moteur de recherche renvoie une liste des occurrences d´etect´eesdu terme de requˆete dans le rpertoire. L’´etat-de-l’art des syst`emes DTP utilise un syst`emede reconnaissance vocale automatique (RVA) pour g´en´erer une repr´esentation ´ecrite de laparole et recherchent cette repr´esentation de trouver occurrences hypoth´etiques des termesde la requˆete. Diÿ´erentes conditions acoustiques, populations de parleurs, et les styles deparler tout cela contribue `a performances g´en´eralement mauvaise de reconnaissance de laparole dans de nombreuses sc´enario de DTP. En outre, la taille des d´epˆets de m´edias peutˆetre tr`es grande, dans certains cas, de l’ordre de milliers d’heures de matriel audio. Celar´eduirait la pr´ecision de recherche et la vitesse respectivement dans les syst`emes DTP bas´esRVA. L’objectif de cette th`ese est de traiter ces probl`emes.Le travail pr´esent´e dans cette th`ese contribue en quatre points. La premi`ere contributionest le d´eveloppement dune approche DTP bas´ee RVA qui soit rapide et pr´ecisepour une grande collection audio. Cette approche est bas´ee sur la cr´eation d’un indexde sortie du syst`eme RVA. Une recherche bas´ee phon`emes est eÿectu´ee en deux ´etapespour trouver les occurrences des termes de la requte l’intrieur et l’extrieur du vocabulairede RVA. La seconde contribution est le d´eveloppement dune approche bas´ee graphepour v´erifier la pr´esence des termes recherch´es dans l’ensemble des intervalles candidatsde parole qui sont trouv´es `a partir dun syst`eme DTP. Dans cette approche, les scores deconfidence associ´es avec la pr´esence des termes recherch´es, g´en´er´es par le syst`eme DTPoriginal, sont ajust´es `a partir de la similarit´e acoustique entre les intervalles acoustiquescorrespondants et avec dautres intervalles dans le r´epertoire. La troisi`eme contribution decette th`ese est l’utilisation d’une reprsentation de caractristiques acoustiques et un formalismede mod´elisation, distincts de ceux utilis´es dans les syst`emes RVA conventionnels pourla g´en´eration nouveaux scores de confiance associs aux occurrences de termes de requtehypothtique. Il est d´emontr´e que la performance obtenues par les scores de confidencealternatives sont compl´ementaires avec les scores estim´es grˆace aux syst`emes DTP-RVAconventionnels. La quatri`eme contribution est le d´eveloppement de deux approches semisupervis´e bas´ee manifold pour la v´erification des occurrences de termes de requˆete d´etect´es.Il est d´emontr´e que le d´eploiement de donn´ees non-´etiquet´ees, en plus des donn´ees ´etiquet´eesdans la formation des mod`eles semi-supervis´es, am´eliore la pr´ecision de v´erification. Deplus, dans les sc´enarios avec des ressources faibles, de bonnes performances DTP sont atteintes,en exploitant seulement la similarit´e des occurrences de termes de requˆete d´etect´esgrˆace `a une approche semi-supervis´ee, bas´ee graphe appel´e spectral clustering

eScholarship@McGill